Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

File type detection algorithm based on principal component analysis and K nearest neighbors

YAN Mengdi, QIN Linlin, WU Gang

Journal of Computer Applications 2016, 36 (11): 3161-3164. DOI: 10.11772/j.issn.1001-9081.2016.11.3161

Abstract （585）

PDF （583KB）（480）

Save

In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate, a new content-based file-type detection algorithm was proposed, which was based on Principal Component Analysis (PCA) and K Nearest Neighbors ( KNN). Firstly, PCA algorithm was used to reduce the dimension of the sample space. Then by clustering the training samples, each file type was represented by cluster centroids. In order to reduce the error caused by unbalanced training samples, KNN algorithm based on distance weighting was proposed. The experimental result shows that the improved algorithm, in the case of a large number of training samples, can reduce computational complexity, and can maintain a high recognition accuracy rate. This algorithm doesn't depend on the feature of each file, so it can be used more widely.

Reference | Related Articles | Metrics

Select

Data deduplication in Web information integration

LIU Xueqiong WU Gang DENG Houping

Journal of Computer Applications 2013, 33 (09): 2493-2496. DOI: 10.11772/j.issn.1001-9081.2013.09.2493

Abstract （578）

PDF （645KB）（401）

Save

Since traditional data dedupliation methods are of low time efficiency and detection accuracy, a Stepwise Clustering Data Elimination (SCDE) method was presented based on the features of Web information integration. Firstly the whole record set was divided into sub-sets using both key attributes division and the Canopy clustering technique, and then the similar records in each sub-set were accurately eliminated. A fuzzy entity matching strategy based on dynamic weight was proposed to accurately eliminate the duplicate records, which reduced the influence of missing attribute on record similarity calculation, and the name of company was especially treated to improve the matching accuracy. The results show that the method is superior to traditional algorithms in time efficiency and detection accuracy, and the precision is improved by 12.6%. The method is applied in forestry yellow page system and performs well.